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Summary 

The complete 9193-nucleotide sequence of the prob- 
able causative agent of AIDS, lymphadenopathy-asso- 
ciated virus (LAV), has been determined. The deduced 
genetic structure is unique: it shows, in addition to the 
retroviral gag, pol, and env genes, two novel open 
reading frames we call Q and F, Remarkably, Q is lo- 
cated between pol and env and F is half-encoded by 
the U3 element of the LTR. These data place LAV apart 
from the previously characterized family of human 
T cell leukemia/lymphoma viruses. 

introduction 

The recent onset of severe opportunistic infections among 
previously healthy male homosexuals has led to the char- 
acterization of the acquired immune deficiency syndrome 
(AIDS) (Gottlieb et at., 1981; Masur et al., 1981). The dis- 
ease has spread dramatically, and new high-risk groups 
have been identified: patients receiving blood products, 
intravenous drug addicts, and individuals originating from 
Haiti and Central Africa (Piot et a!., 1984). AIDS is a fatal 
disease, and there is at present no specific treatment. The 
causative agent was suspected to be of viral origin since 
the epidemiological pattern of AIDS was consistent with 
a transmissible disease, and cases had been reported af- 
ter treatment involving ultrafiltered anti-hemophilia prepa- 
rations (Daly and Scott, 1983). A decisive step in AIDS re- 
search was the discovery of a novel human retrovirus 
called lymphadenopathy-associated virus (LAV) (Barre- 
Sinoussi et al., 1983). The properties of the virus consis- 
tent with its etiological role in AIDS are: the recovery of 
many independent isolates from patients with AIDS or 
related diseases (Montagnier et al., 1984); high LAV 
seropositive among these populations (Brun-Vezinet et 
al., 1984); a tropism and cytopathic effect in vitro for the 
helper/inducer T-lymphocyte subset T4 (Watzmann et al., 
1984), also found depleted in vivo. 

Other groups have reported the isolation of human 
retroviruses, the human T cell leukemia/lymphoma/lym- 
photropic virus type III (HTLV-lil) (Popovic et al.. 1984) and 
the AIDS-associated retrovirus (ARV), which display bio- 
logical and serc-epidemiological properties very similar to 
if not identical with those of LAV (Levy et ai., 1984; Popovic 
et al., 1984; Schupbach et ai., 1984). Both LAV and HTLV- 



III genomes have been molecularly cloned (Alizon et al., 
1984; Hahn et al., 1984). Their restriction maps show 
remarkable agreement, including a Hind III restriction site 
polymorphism, bearing in mind the variability of this virus 
(Shaw et ai., 1984) and confirming that these two viruses 
represent a single viral lineage. 

In addition to its obvious diagnostic and therapeutic 
potential, the LAV DNA nucleotide sequence is essential 
to an understanding of the genetics and molecular biology 
of the virus and its classification among retroviruses. We 
report here the complete 9193-nucleotide sequence of the 
LAV genome established from cloned proviral DNA. 

Results 

DNA Sequence and Organization of the LAV Genome 

We have reported previously the molecular cloning of both 
cDNA and integrated proviral forms of LAV (Alizon et al., 
1984). The recombinant phage clones were isolated from 
a genomic library of LAV-infected human T-lymphocyte 
DNA partially digested by Hind III. The insert of recom- 
binant phage U19 was generated by Hind III cleavage 
within the R element of the long terminal repeat (LTR). 
Thus each extremity of the insert contains one part of the 
LTR. We have eliminated the possibility of clustered Hind 
III sites within R by sequencing part of an LAV cDNA 
clone, pLAV 75 (Alizon et al.. 1984), corresponding to this 
region (data not shown). Thus the total sequence informa- 
tion of the LAV genome can be derived from the AJ19 
clone. 

Using the M13 shotgun cloning and dideoxy chain ter- 
mination method (Sanger et al., 1977), we have deter- 
mined the nucleotide sequence of AJ19 insert. The recon- 
structed viral genome with two copies of the R sequence 
is 9193 nucleotides long. The numbering system starts at 
the cap site (see below) of virion RNA (Figure 1). 

The viral (+) strand contains the statutory retroviral 
genes encoding the core structural proteins (gag), reverse 
transcriptase (pol), and envelope protein (env), and two 
extra open reading frames (orf ) that we call Q and F (Table 
1). The genetic organization of LAV, SlTR-gag-pol-O-env- 
F-31TR, is unique. Whereas in all replication-competent 
retroviruses pol and env genes overlap, in LAV they are 
separated by orf 0 (192 amino acids) followed by four 
small (<100 triplets) orf. The orf F (206 amino acids) 
slightly overlaps the 3' end of env and is remarkable in that 
it is half-encoded by the U3 region of the LTR. 

Such a structure clearly places LAV apart from previ- 
ously sequenced retroviruses (Figure 2). The (-) strand is 
apparently noncoding. The additional Hind III site of the 
LAV clone AJ81 (with respect to U19) maps to the appar- 
ently noncoding region between 0 and env (positions 
5166-5745), Starting at position 5501 is a sequence 
(AAGCQT) that differs by a single base (underlined) from 
the Hind III recognition sequence. It is anticipated that 
many of the restriction site polymorphisms between differ- 
ent isolates will map to this region. 
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-mqili 

CCTCTCTCTSCmCACCAamCAKCTGGCACCTCTCT^ 

100 

CTCACTCTGCTAAtTACACATCCttCAGACCCTmACTCAGTCTGGA^AATCTCTAC 

200 ... 
GAG» leuAlaCluAlaArgArgArgCliaVej^lyAUArgAlaSerViileuSer 

CACTCCCmCCTCAACCCCCCACCCCAACACaCACCCC^^ 

300 ..... 

ClyClyCUUuAipArjTrpCULyiiltArgL«uArgProClyClyLy«LyiLy»TyrLy«UuLy»Ht»UeVilTrpAl«StrAjaCUUuCluArjPheAUV*lAiaProCly 
aCCMCJUUAnACATCCATCCCAAAAAAnCCCnAACaCACCCCQUU^ 

400 ....... 

L«uLtuCluThrSfrCUClyCyiArtClQU«LffuClyCLnLtuCla?roStrL<uClQThrCLyS*rCUCULcuAr(SirLiuTyrAtnThrV a Ul«ThrLcuTyrCy«ViUis 
CCCTCTTAGiAMCATCAGMCGCrCTAGACAAATACTCCCACACCTACAACCATCCCTTCACACACCATCACAACA^ 

500 ......... 600 

CliUr|ll*CLulleLyaAtpThrlyiGluAULtuAtpLytlltGluCluGU^ 
ATCAAACaTAGACATAAAACACACCAACGAACCrttACACAAGATAGAGCAA 

. . . . . 700 

Cl^toTyt groUfVilClaA»nUtCloClyGlDM»tVtlH t iClnAl*IId S«rProAriThrLtuAffLAUTrpViUyiV«lValGluCULy«AUPheS€rProCluV«Ute 
CCCAAAAnACCCTATACTCaCAACATC CACGCGCAAATC CTACATC ACCCCATATC AC CTACAACmAMTCaTCOTAAAAGTACTACAACACAACGCTTTCAGCCaGAACTCA 

800 ... 

ProM*cPheStTAi«LtuStrCUClyAUrhrProClaAtpLeuA«amHecLeuA«nThrV«^ 
TACCCaTCTTTTCACCATTaTCAGAAGGACCCACCCCACAACATTTAMCACCATCCTAMCAC^ 

900 ..... 

AUCluTrpAipAr|V«iiii a ProV4lHiaALaGlyProIleAl«ProClyClnKtcArsCluProArgClyStrAipIleAUGlyT))rThrStrThrL«uClaCluClonflClyTrp 
CTCCACAATCCGATACACTGCATCCAGTCCATGCACCGCCTATTGCACCACCCCACATGAGACAACCAAGGCCAAGT^ 

1000 ........ 

MteThrA*aAaaProProneProValClyClulUTyrLyiArsTrpIUUtLtuClyUuAfnLyiUtV«UrsHtcTyrStrProThr$trILfLtuA«pILcArgClaClyPro 
GCATCACAMTAATCaCCTATCCaCTAGGAGAAAmATAAAAUTC^ 

1100 ......... 1200 

LyaCluProPj^rgAapTyrValAspArgPhfTyrlyaThrleuArgAUCluGl^ 

CAAAACAAC c ctttagacactatctacacccgttctataaaactctaacaccccaccaaccttcacaccac ctaaaaaattccatcacacaaac ctt c ttcgtc caaaatcccaac c cac 

1300 

Cy»LytThr [leUuLyiAlaUuClyProAliAUThrLeuCluCluiUt^cThrAUCyiCUClyValClyClyProClyd^LyiAlaAriValUuAUCiuAUM/tStrCln 
Arret AACACTATTTTAAAAGCATTCGGACCACCACCTACACTAGAA£AAATGATGACACCATGTCACGGACTCCGaCC^ 

uoo .... 

VilTtrA*nSrrAliThrIl«H«tM*tCLQArgCiyAiQphtAxaA»QClaArgLy«lUViUyiCy»PhtAinCytClyLyiCluClyUiillfAUArtA»DCy»ArsAUProArt 
AACTAACAAATTCAGCTACCaTAATCATCCAAAGACCCAATTTTACCMCCAAAGAAAC^ 

1500 ...... 

POL a* PhtPhtAjgCiuAapUuAUPtatUuClnClylyaAlaAr gCluPh«S*r 
LyiLysClyCyiTrpLyiCyiCLyLysCluClyliitClaHtcLyiAtpCyiThrGluArtClaAlaAiaPhcUuGlyLyilliTrpProStrTyrLyiClyArgProClyAiaPhtLeu 
CCAAAAACGCCTCTTCCAAATCTCCAAAGGAACCiACACCAA>TCAXACATTCTACTCAGACACACC 

1600 ........ 

SarGluClnThrArg>UAan5crProThrArgAr|CluL<uClaV«lTrpGlyArgA«pA*iU»nScrUuS«rCluAUGlyAUAapArgGlaGlyThrV«lStrPhtA<QPhcPro 
Cla5crArgProGiuProThrAiaProProCUCluSfrPhcArgStrGlyV«lCUThrThrThrProStrGlnLysGlnCluProIlffAapLytGluLtuTyrProLtuThrSirLeu 
TTCACAGCAGACCACAGCCAACaCCCCCaCCAGAAGAGACCTTCAGGTCTGGGGTAGAGACAACAACTCCCTCTCAGAACC^ 

WOO . . . , . . . . . 1800 

CUIWThrLeuTrpClrUrgProLtuVilThrUtLyilltClyClyCU 
ArgStrlauPh«GlyA»&A«p?roSer$trCla • 
TCAUTaCTCTnCCOACCACCCCTCCTaCAATA>AaTACCCCCCCAACTAMCCAACCTCTAm<UTAU 

• ........ 1900 

TrplyiProLy^^lcClyClyIUGlyC lyPha IlaLytV«lArgGlnTyrAipClDUeL«uIleGlulUCyiGlyHiiLyiAl«IltClyTbfVilUuValGlyProThrPf^ 
ATCGAAACCAAAAATCATACCCGGaATTCCaCCTTTTAICAAACTAACACACTATGATCACATACTCATAGAAATCTCTC 

2000 .... 

VaU«ancUeGiyArgAiBLcuUuThrClaIl«GlyCytThrL«uAaaPhffProILaStrProIUCUThrValFroV«lLytLauLyaFroGlyH«cAipClyProLysV4lLys 
TCTCAACATAAHGCAAGAAATCTCTTCACTCACATTCCTICCACTTTAAATTrrCCCATTACTCCT 

2100 ...... 

GUTrpProLtuThrGluCluLyaUtLyaAUUuValCUUeCyaThrCluMftCluLyaGluClyLyaUtStcLyallcGlyProGluAaQProTyrAanThrPToValPhaAla 
ACaaTCCCCATTGACaCAACAAAAAATAAAAC^TTACTACAAATTTCTACAGA^TCCAAAACGAAGCCA^ 

2200 ........ 

lULyaLyaLyaAipStrmiyaTrpArgLyiltuValAepPhtArgClule^ 
CATAAACAA A AAACACaCTACTAAATCGACAAAATTACTAXATTTCACAGAACTTAATAAG^ 

2300 . . r t t t f ^ .2400 

lyiLyaStrValThrValUuAapValClyAaaAlaTyrPh^rValProUuAapCluAapPheArgLyaTyrThrAUPhtThrlUProSa 
C^AAAAATaCTAACACTACTCCATCTCGCTCATCCATATTTm 

2500 

ArgTyrGlaTyrA»oValUtt?roCloClyTrpLyiClyStrProAlaIUPhtClnStrStrK«cThrLyiIUUuCUProPheAjgLy»ClaAiQProAipUtVanUTycCU 

tagatatcagtacaatctccttccacacccatcgiaaaccaicaccaccaatattccaaac 

2600 .... 

TyrHatAapAapLauTyrValClyStrAapLattCluUfCiyClaHiaArgm 

*^^^GCATCATTTCTATCTACCATCTCACTTACAAATACGfiCliXAT lfIaif iAiia ^ 

2700 ...... 

LyiCluProProPhtLtuTrpH«tClyTyrCluWu*iiProAaplyaTrpT^ 
CAAACAACCTCCaTTCCTTTCCATGCCTTATGAACTCCATCCTCATAAATCCACACTACACCCT 

2M0 . . . 

Lyil«uAiQTrpAliS«rClfl[l«TyrProClyIULyiV«lArtClQUuCyiLy»LauL«uAr|ClynirLyiAUUuThrCluValIl»ProUuTbrCluCluAUCluLtuClu 
AAMrTGAATTCCCCAACTCACATTTACCCAGC<UTTAAAi^AACCCAATTATCTAAACTCCTrACAC 

. , 2900 ......... 3000 

UuAUGLuAsaArgCluUiUuLyaGluProValiiiaClyValTyrTyrAapProSarLyaAapUuIUAlaCUIUGULyaCloGtyCUClyGlaTrpTtarTyrCUtitTyr 
ACTCCCACAAAACACACAGATTCTAAAACAA C CAGTACATGCACTGTATTATCACCCATCAAAACAC 

3100 

CloCluProPhelyiAataauLyaT*rClylyaTyrAlaArgThrAr|ClyAlaHiam 
TCAACACCCATTTAAAAATCTCAAAACAGCAAAATATGCAA C AACGA C CCCTCCCCACACTAATCATCTA 

3200 .... 
TrpClyLyaThrProLyaPhfLyeLfuProntCUlyiGUTbrTrpCluThrTrpTrpThrGluTyrTrpCliUlaTbrTfplUPioCluTfpCUPhtValAaaTbfPfoProLau 

atcccgaaagactcctaaatttaaactacccatacaaaaccaaacatcccamcatcctgcacacact^ 

3300 ...... 

ValLyaLtuTrpTyrCUUuGluLyaCUProIltValClyAlaCluThrPhtTyrVaUapGlyAlaAiaSfrArgClunirLyBLauClyLyaAUCiyTyrValThrAaaATgCly 
ACTCAAATTATCCTACCACTTACACAAACAA C CCATACTACCaCCAGAMCCTTCTATCTACATC 

3400 

ArgClDLyeVamiThrLauThrAapThrmAaoCloLyaThrCluLauClafcUUtHiattuAlaUuCl^ 
AAGACAAAAACTTCTCACCCTAACTCACACAACAAATCACAACACTCACTTACAACCAAtTCATC 

3500 ......... 3600 
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■ssassssBssasss^ 
•aaassssBSKsass^^ 

"fi;;is:;t'!s;;K£ , ''"." i "" , " , ' , »" ,i "' s "»i<""'p«'-»'. ■ 
•aaasgssBSsaaa^^ 

■asssssssassaas^^ 

* • 5600 , 
lACT^C^TAATmACCXATACTTCTCTC WaTlCTAATaTACJUTATACC^TiTT^^^ . .„ •■LytCluClftLyiThr 

sssssflsaasasaas^^ 
ssssasssasasssata 

■-JjsasssssBsasa^ 

7300 
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Gl»UtClyAUL«FhtUwClyFhtUuClyMaiiGlyS«rTM 
CGAATACCJUiCTTTtrrrCCTTCCCTTC^ 

. • . . . . 7*00 .... 

A IQ l«uUuAr*AlilUCWAUCloCloHi«ltuL«uGlaUu^ 
UTmCTWGCCCTATTC^ttCCJUUUttATCTCTO 

7 500 



ClyntTrpClyCyiStrCl7LytLtuIltCy«TbrT^rAUV«lProTrpA«aAUStrTrpS«rl<DLyi3«rUuCIuCUIltTrpAiaAjD^trhrTrpK«tCluTrpA»pAr8 
CCCATTTCGWnCCTCTCCjUUiU^CATTTCCACCACTCCTCTCCCT 

. , . 7600 ........ 

CUIlfA»QAioTy^rStrUuIl«HitS«rUuIitCluCluStrClQAiQCUClDCluLy«AjoCluCloCiuL«uL«uCUUuAjpLyiTrpAUS«rUuTrpAjaTrpPhe 
CAMnAACAATTACACAACCTTAATACATTC CTT AATTCJUUIAATC CCAAAACCACCAAGAAAAfWU TG A Af O C AATTATTCCAATTACATAAATCCCCAACTTTCTC GAATTC CTTT 

. 7700 ......... 7800 

AiolltThrAtoTrpUuTrpTyTlltLyilUmiUMatlltValGlyCLyU^^ 

aacataaglaattcgctctcctatataaaaatattcataatcatacta^ 

7900 . 

ProUuSerPhtClaThrHiiL«urromProArtClyProAipAj|ProCUCly^^ 

canATccTrracAccuccTcccjuu:ccccACC6accccAacacccAitfCAATtrjuu^ 

... . . . 8000 .... 

Al«UulliTrpAtpA«pUtfaAr|$«rUuCytUufb«StrTyrHiiArsUuA*|^ 
CUmATCTCCaCCATCTCCCOCCCTCTCCCTCTTCACaACacCCCrr^^ 

8100 . ..... 

LytTyrTrpTrpAsaUuL«uClaTyrTrpStrGlaGluUuLyaAioStrM«V*LStrUuUuA«aAlaThrAl«IlUUV«UUCluCl7T%rA«pAr|ValIltCluValVaL 
AAATATTCGTCGJUTCTCCTACACTATTCGACTCaGGAACTAAA 

8200 ........ 

CloGlyAUCysArcAUU«AriHialltProArsAr(atArgCUGlyUuCluAr|U«L«uLcu • 

ORP F A>pAra>UTrpLyaCly Pb€Cy iTyTLyrfuBciyGlyly»TrpStrlyS«r8<r?al TalC lyTrpFroTbryal 
CAACGACCrrCTACAGCrATTCCCCACATACCTAGAAXJUlTAA^CACCGCTTC 

8300 ......... 8400 

Ar|CluArt^tAxtArtAlaCluProAlaAlaAjpClyValGlyAlaAl«SarAx|AapUuCluLyaBiiClyAl*IltTbrS«tt«rA«an^ 
AACGCgUUUUATCAtUCCAOTCJLaaCCA<XACATCCC^C^ 

8500 

TrpUuCluAUGloCUGluCWCluV»LClyfhtProV«lThrProCU 

8600 .... 

UuCluGlyUttIl«HiiStrCiaAJiAT|CloAjpIleUuA*pUuTrpU^ 
ACTGOAXCCaAATTUCTCCGlACGAA^CAAaiATCmGATCTCTC 

8700 ...... 

UumFfatClyTrpCy«TyrlytUuVtLrraV«lCluFroA«pLy«V«lCluGluAla^ 
ACTUCCTmCATC^TCCTACAAAXTACTACCACT^ 

8800 ........ 

Cli^rgCWValUuGluTrpAxtFtatAjpSfrA^tL«uAUFbtHiiHiaValAl«Ax|GluUulisFroGluTyrPbtlyiA*oCyt • 
TCACA<U<aACTCnACAGTCCAGCTnCAa«Ca^ 

8900 9000 
CGCTGGCaCTmaCCCAGCCCTCCCCTCGCCCC^CTCCCC^^CCGJ^ 

. Hind HI .... 9100 

UGCCTCGCAGCTCTCTCCCTAACTACGaACCCAaCCTrAAC€CTCAATAAA» 

9191 

I 

Figure 1. Complete ONA Sequence of Viral Genome (LAV-la) 

The sequence was reconstructed from the sequence of phage aJ19 insert. The numbering starts at the cap site, which was located experimentally 
(see above). Important genetic elements, major open reading frames, and their predicted products are indicated together with the Hind III cloning 
sites. The potential glycosylation sites in the env gene are overlined. The NH,« terminal sequence of p25^ determined by protein microsequencing 
is boxed (Genetic Systems, personal communication). 

Each nucleotide was sequenced on average 5-3 times: 85% of the sequence was determined on both strands and the remainder was sequenced 
at least twice from independent clones. The base composition is T. 22.2%; C, 17.8%; A. 35.8%; G. 24.2%; G + C. 42%. The dinucleotide CpG 
is greatly under-represented (0.9%) as is common among eukaryotic sequences (Bird. 1980). 



The LTR 

The organization of a reconstructed LTR and viral flanking 
elements are shown schematically in Figure a The LTR is 
638 bp long and displays usual features (Chen and Barker, 
1984): it is bounded by an inverted repeat (5ACTG) includ- 
ing the conserved TQ dinucleotide (Temtn, 1981); adjacent 
to 5' LTR is the tRNA primer binding site (PBS), com- 
plementary to tRNA* (Raba el al., 1979); adjacent to 3' 
LTR is a perfect 15 bp polypurine tract. The other three 



polypurine tracts observed between nucleotides 
8200-8800 are not followed by a sequence that is com- 
plementary to that just preceding the PBS. 

The limits of U5, R, and U3 elements were determined 
as follows. U5 is located between PBS and the polyadeny- 
lation site established from the sequence of the 3' end of 
oligo(dT>primed LAV cDNA (Alizon et al., 1984). Thus US 
is 84 bp long. The length of R + U5 was determined by syn- 
thesizing tRNA-primed LAV cDNA. After alkaline hydroly* 



Table 1. Locations and Sizes of Viral Open Reading Frames 



1* Triplet Met Stop No. Amino Acids 



gag 


312 


336 


1.836 


500 


55.841 


pol 


1.631 


1.934 


4.640 


(1.0O3) 


(113.629) 


orf Q 


4.554 


4.587 


5.163 


192 


22.487 


env 


5.746 


5.767 


8.350 


861 


97.376 


orf F 


8.324 


8.354 


8.972 


206 


23.316 



The nucleotide coordinates refer to the first base of the first triplet (1 " triplet), of the first methionine (initiation) codon (Met) and of the stop codon 
(Stop). The numbers ot amino acids and molecular weights are those calculated for unmodified precursor products starting at the first methionine 
through to the end. with the exception of pol. where the size and M, refer to that of the whole ort. ^ 
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Figure 2. Comparison of the Genome Organization of LAV with Those 
of Human T Cell LeukemiaAymphoma Virus Type I (HTLV-I) (Seiki et 
al.. 1983), Moloney Murine Leukemia Virus (MoMuLV) (Shinnick et al.. 
1981). and Rous Sarcoma Virus (RSV) (Schwartz et al., 1983) 
The positions and sizes of viral genes are drawn to scale (open boxes) 
and the viral genomes (RNA forms) are delimited by brackets. 

sis of the primer, R+U5 was found to be 181 ±1 bp (Fig- 
ure 4). Thus R is 97 bp long and the cap site at its 5' end 
can be located. Finally, U3 is 456 bp long. The LAV LTR 
also contains characteristic regulatory elements: a poly- 
adenylation signal sequence AATAAA 19 bp from the R-U5 
junction, and the sequence ATATAAG, which is very likely 
the TATA box. 22 bp 5' of the cap site. There are no long 
direct repeats within the LTR. Interestingly, the LAV LTR 
shows some similarities to that of the mouse mammary tu- 
mor virus (MMTV) (Donehower et al., 1981). They both use 
tRNA'* 5 as a primer for (-) strand synthesis, whereas alt 
other exogenous mammalian retroviruses known to date 
use tRNA pro (Chen and Barker, 1984). They possess very 
similar polypurine tracts; that of LAV is AAAAGAAAAGG- 
GGGG while that of MMTV is AAAAAAGAAAAAAGGGGG. 
It is probable that the viral (+) strand synthesis is discon- 
tinuous since the polypurine tract flanking the U3 element 
of the 31TR is found exactly duplicated in the 3' end of orf 
pol, at 4331-4346. In addition, MMTV and LAV are excep- 
tional in that the U3 element can encode an orf. In the 
case of MMTV, U3 contains the whole orf while, in LAV, U3 
contains 110 codons of the 3' half of orf F. r 

Viral Proteins 
gag 

Near the 5' extremity of the gag orf is a "typical" initiation 
codon (Kozak, 1984) (position 336), which is not only the 
first in the gag orf, but the first from the cap site. The 
precursor protein is 500 amino acids long. The calculated 
M r of 55341 agrees with the 55 kd gag precursor poly- 
peptide (Luc Montagnier, unpublished results). The N- 
terminal amino acid sequence of the major core protein 
p25, obtained by microsequencing (Genetic Systems, per- 
sonal communication), matches perfectly with the trans- 
lated nucleotide sequence starting from position 732 (see 
Figure 1). This formally makes the link between the cloned 
LAV genome and the immunologically characterized LAV 
p25 protein. The protein encoded 5' of the p25 coding se- 
quence is rather hydrophilic Its calculated M r of 14.866 is 
consistent with that of the gag protein pl& The 3' part of 
the gag region probably codes for the retroviral nucleic 
acid binding protein (NBP). Indeed, as in HTLV-I (Seiki et 
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Figure 3. Schematic Representation of the LAV Long Terminal Repeat 
(LTR) 

The LTR was reconstructed from the sequence of U19 by juxtaposing 
the sequences adjacent to the Hind III cloning sites. Sequencing of 
oligo<dT)-primed LAV ONA clone pLAV75 (Ahzon et al.. 1984) rules out 
the possibility of clustered Hind III sites in the R region of LAV. LTR are 
limited by an inverted repeat sequence (IR). Both of the viral elements 
flanking the LTR have been represented as tRNA primer binding site 
(PBS) for 5' LTR and polypwina 'rack (PU) for 3' LTR. Also indicated 
are a putative TATA box. the cap site, polyadenylation signal (AATAAA), 
and polyadenylation site(CAA). The location of the open reading frame 
F (648 nucleotides) is shown above the LTR scheme. 

al., 1983) and RSV (Schwartz et al., 1983), the motif Cys- 
X^Cys-Xj-f-Cys common to all NBP (Oroszlan et al., 1984) 
is found duplicated (nucleotides 1509 and 1572 in LAV se- 
quence). Consistent with its function the putative NBP is 
extremely basic (17% Arg + Lys). 
pol 

The reverse transcriptase gene can encode a protein of up 
to 1003 amino acids (calculated M r = 113,629). Since the 
first methionine codon is 92 triplets from the origin of the 
open reading frame, it is possible that the protein is trans- 
lated from a spliced messenger RNA, giving a gag-pol 
polyprotein precursor. 

The pol coding region is the only one in which signifi- 
cant homology has been found with other retroviral protein 
sequences, three domains of homology being apparent. 
The first is a very short region of 17 amino acids (starting 
at 1856). Homologous regions are located within the pl5 
gag RSV protease (Dittmar and Moelling, 1978) and a poly- 
peptide encoded by an open reading frame located be- 
tween gag and pol of HTLV-I (Figure 5) (Schwartz et al., 
1983; Seiki et al., 1983). This first domain could thus cor- 
respond to a conserved sequence in viral proteases. Its 
different locations within the three genomes may not be 
significant since retroviruses, by splicing or other mecha- 
nisms, express a gag-pol polyprotein precursor (Schwartz 
et al., 1983; Seiki et al., 1983). The second and most ex- 
tensive region of homology (starting at 2048) probably 
represents the core sequence of the reverse transcrip- 
tase. Over a region of 250 amino acids, with only minimal 
insertions or deletions, LAV shows 38% amino acid iden- 
tity with RSV, 25% with HTLV-I, and 21% with MoMuLV 
(Schinnick et aJ., 1981) while HTLV-I and RSV show 38% 
identity in the same region. A third homologous region is 
situated at the 3' end of the pol reading frame and corre- 
sponds to part of the pp32 peptide of RSV that has ex- 
onuclease activity (Misra et al., 1982). Once again, there 
is greater homology with the corresponding RSV se- 
quence than with HTLV-I. 
em 

The env open reading frame has a possible initiator 
methionine codon very near the beginning (eighth triplet). 
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Figure 4. Synthesis of RNA-Primed LAV cONA for R+U5 (Strong-Stop 
cDNA) 

Lanes 1 and 2 show two different quantities of cDNA white lanes M and 
M' represent markers. The strong-stop cONA is 181 bases long with a 
second, less intense band al 180. The error of estimation is x 1 bp. This 
maps the major cap site to the second G residue of the sequence 
CTCGGTCT within the LTR. 24 nucleotides downstream of the TATA 
box. This guanosine residue is taken as the first base in the nucleotide 
sequence shown in Figure 1. 

If so. the molecular weight of the presumed env precursor 
protein (861 amino acids, M r calc » 97,376) is consistent p 
with the known size of the IAV glycoprotein (110 kd and 
90 kd after glycosidase treatment; Luc Montagnier, unpub- 
lished). There are 32 potential N-glycosylation sites (Asn- 
X-Ser/Thr), which are wertined in Figure 1. An interesting 
feature of env is the very high number of Trp residues at 
both ends of the protein. There are three hydrophobic 
regions, characteristic of the retroviral envelope proteins 
(Seiki et al., 1983). corresponding to a signal peptide (en- 
coded by nucleotides 5815-5850 bp), a second region 
(7315-7350 bp), and a transmembrane segment (7831- 
7896 bp). The second hydrophobic region (731&-7350 bp) 
is preceded by a stretch rich in Arg + Lys. It is possible 
that this represents a site of proteolytic deavage, which, 
by analogy with other retroviral proteins, would give an ex- 
ternal envelope polypeptide and a membrane-associated 
protein (Seiki et al.. 1983; Kiyokawa et al., 1984). A striking 
feature of the LAV envelope protein sequence. is that the 
region following the transmembrane segment is of un- 
usual length (150 residues). The env protein shows no 
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Figure 5. Location of a Short Stretch of Homology in the gag-pol Re- 
gion of the LAV, HTLV-I (Seiki et al.. 1983) and RSV (Schwartz et at.. 

1983) Genomes 

Conserved amino acids are boxed. Homologous region is shown by 
the solid bar in the schema. Each virus is organized differently in this 
region but the sequence in the RSV genome maps to piS 8 * 5 . which 
has a protease-associated function. 

homology to any sequence in protein data banks. The 
small amino acid motif common to the transmembrane 
proteins of all leukemogenic retroviruses (Cianciolo et al., 

1984) is not present in LAV env. 
0 and F 

The location of orf Q is without precedent in the structure 
of retroviruses. Orf F is unique in that it is half-encoded 
by the U3 element of the LTR. Both orf have strong initiator 
codons (Kozak, 1984) near their 5' ends and can encode 
proteins of 192 amino acids (M, calc * 22,487) and 206 
amino acids (M r calc * 2a316), respectively. Both puta- 
tive proteins are hydrophilic (pQ 49% polar, 15.1% Arg + 
Lys; pF 46% polar, 11% Arg + Lys) and are therefore un- 
likely to be associated directly with membrane. The func- 
tion for the putative proteins pQ and pF cannot be 
predicted, as no homology was found by screening pro- 
tein sequence data banks. Between orf F and the pX pro- 
tein of HTLV-I there is no detectable homology. Further- 
more, their hydrophobicity/hydrophilicity profiles are 
completely different. It is known that retroviruses can 
transduce cellular genes-notably proto-oncogenes 
(Weinberg, 1982). We suggest that orfs Q and F represent 
exogenous genetic material and not some vestige of cellu- 
lar ONA because LAV DNA does not hybridize to the hu- 
man genome under stringent conditions (Alizon et al.. 
1984), and their codon usage is comparable to that of the 
gag, pol, and env genes (data not shown). 

Relationship to Other Retroviruses 

Although LAV is both morphologically and biochemically 
(Barre-Sinoussi et al.. 1983) distinct to HTLV-I and -II, it re- 
mained possible that its genome was organized in a simi- 
lar manner. The characteristic features of HTLV-I and -II 
genomes, which they share with the more distantly related 
bovine leukemia virus (BLV) (Rice et al., 1984), are not 
observed in the case of LAV. These are: a region 3' of 
the envelope gene consisting of a noncoding stretch 
(600-900 bp), followed by a coding sequence of 307-357 
codons (X open reading frame), which may slightly over- 
lap the U3 region of the LTR (Seiki et al., 1983; Rice et al., 
1984; Sagata et al.. 1984) and, second, the LTR being 
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Table 2. Comparison of the Size of the LAV LTR and LTR-Reiated 
Element to -Those of Other Retroviruses 





LTR 


U3 


R 


U5 


PU 


PBS 


IR 


LAV 


638 


456 


97 


85 


15 


LYS 


4 


HTLV-I 


759 


355 


228 


176 


12* 


PRO 


4' 


HTLV-II 


763 


314 


248 


261 


12* 


PRO 


4' 


MMTV 


1.332 


1.197 


11 


124 


19 


LYS 


8' 


MoMuLV 


594 


449 


68 


77 


13 


PRO 


13 


RSV 


335 


234 


21 


80 


11 


TRP 


15 


SNV 


601 


420 


97 


80 


13 


PRO 


9 



Adapted from Chen and Barker (1984). 
i a imperfect match or tract. 

SNV o spleen necrosis virus (Shimotohno and Temin. 1982). 



composed of unusually long U5 and R elements and the 
polyadenylation signal being situated in U3 instead of R 
(Seiki et al., 1983; Sagata et at., 1984; Shimotohono et al., 
1984). We show here that, in contrast, the 3' end of the LAV 
envelope gene overlaps an open reading frame, termed F. 
that has the coding capacity tor 206 amino acids and ex- 
tends within the LTR (1 10 amino acids are encoded by the U3 
region). The putatively encoded polypeptide (pF), the pri- 
mary structure of which can be deduced, does not show 
any homology with the theoretical X gene products of the 
HTLV/BLV family. Also, the U5 and R elements are shorter 
(Table 2) and the polyadenylation signal is located within R, 
as is the case for all retroviruses except the HTLV/BLV. Ad- 
ditionally, LAV uses tRNA'f as (-) strand primer, as op- 
posed to tRNAP f0 employed by all other mammalian retro- 
viruses except MMTV (Donehower et al., 1981). Those 
homologies detected between the polymerase and pro- 
tease domains of LAV and HTLV are also found in several 
retroviruses. RSV in particular. 

It has been reported that a cloned HTLV-llf genome 
hybridizes (T m = 28°C) to sequences in the gag-pol and 
X regions of HTLV-I and -II; although restriction maps of 
cloned LAV and HTLV-I 1 1 show almost perfect agreement 
(Hahn et al., 1984), we were unable to detect any such 
hybridization between LAV and HTLV-II (T m = 55°C) 
(Alizon et al., 1984). Indeed, there is a punctual region of 
homology between LAV and HTLV-I (23/27 f nucleotides 
starting at position 1859 in the LAV sequence) but nothing 
significant between the two viruses in the X region of 
HTLV-I. One possible reason for this discrepancy is that 
HTLV-III is subtly different from LAV. However it was sub- 
sequently reported that there was very minimal, if any, ho- 
mology between orl X (of HTLV-I) and HTLV-III (Shaw et al., 
1984). 

Discussion 

Regulatory sequences carried by retroviral LTR are be- 
lieved to be involved in specific interactions between the 
viral genome and the host cell (Srinivasan et al., 1984). 
The LTR sequences of LAV are unique among retrovi- 
ruses. That could reflect an original mode of gene ex- 
pression, possibly in relation to particular transcriptional 
factors present in the virus-harboring cell. This hypothesis 
can be tested by studying the regulatory activity of the LAV 



LTR sequences in transient or long-term experiments in- 
volving an indicator gene and different cellular contexts. 

The presence of the Q and F reading frames in addition 
to the conventional gag-pol-env set of genes is unex- 
pected. One should now address the question of their role 
in the viral cycle and pathogenicity by trying to character- 
ize their protein product(s). It is tempting to speculate on 
a role of such polypeptide^) in T4 cells' mortality, a prob- 
lem that can be studied by designing synthetic peptides 
for antibody production or by using site-directed mutagen- 
esis of Q and F coding regions. 

The peculiar genetic structure of LAV poses the ques- 
tion of its origin. The virus shares common tracts with other 
(apparently unrelated) retroviruses. For instance, the un- 
usually large size of the outer membrane glycoprotein 
(env) and a comparably sized genome are also observed 
in the case of lentiviruses such as Visna (Harris et al., 
1981; Querat et al., 1984). The presence of a large part of 
the F open reading frame in the LTR, and the use of 
tRNA'y as a primer for (-) strand synthesis, is reminis- 
cent of the mouse mammary tumor virus. On the other 
hand, homologies in the pol gene would suggest that the 
LAV is closer to RSV than to any other retroviruses. Obvi- 
ously, no clear picture can be drawn from the ONA se- 
quence analysis as far as phytogeny is concerned. Thus, 
it may well be that LAV defines a new group of retroviruses 
that have been independently evolving for a considerable 
period of time, and not simply a variant recently derived 
from a characterized viral family. Both epidemiology and 
pathogeny of AIDS should be reconsidered with this idea 
in mind, when trying to answer such questions as these: 
Are there other human or animal diseases that are as- 
sociated with similarly organized viruses? Is there a precur- 
sor to AIDS-associated virus(es) normally present, in la- 
tent form, in human populations? What triggered in this 
case the recent spreading of pathogenic derivatives? 

Experimental Procedures 

M13 Cloning and Sequencing 

Total U19 ONA was sonicated, treated with the Klenow fragment of 
DNA polymerase plus decoryribonucleotides (2 hr, 16°C), and fraction- 
ated by agarose gel electrophoresis. Fragments of 300-600 bp were 
excised, etectroefuted, and purified by Elutip (Schleicher and Schull) 
chromatography. ONA was ethanoi-precipitated using 10 yg dextran 
T40 (Pharmacia) as carrier and ligated to dephosphorylated. Sma I- 
cleaved M13mp8 RF ONA using T4 ONA and RNA ligases (16 hr, 16°C) 
and transfected into E. coii strain TG-l. Recombinant clones were de- 
tected by plaque hybridization using the appropriate u P-labefed LAV 
restriction fragments as probes. Single-stranded templates were pre- 
pared from plaques exhibiting positive hybridization signals and were 
sequenced by the dideoxy chain termination procedure (Sanger et al. . 
1977) using »-»S-dATP (Amersham. 400 Ct/mmol) and buffer gradient 
gels (Biggen et al., 1983). Sequences were compiled and analyzed 
using the programs of Staden adapted by a Caudron for the Institut 
Pasteur Computer Center (Staden, 1982). 

Strong-Stop cONA 

LAV virions from infected T lymphocyte (Barre-Sinoussi et al.. 1983) 
culture supernatant were pelleted through a 20% sucrose cushion and 
the cONA (-) strand was synthesized as described previously (Alizon 
et al., 1984) except that no exogenous primer was used. After alkaline 
hydrolysis {03 M NaOH, 30 min. 65°C), neutralization, and phenol ex- 
traction, the cONA was ethanc4-precipitated and loaded onto a 6% 
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acryiamide/8 M urea sequencing gel with sequence ladders as size 
markers. 
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